Training of neural networks is a computationally intensive task. The significance of understanding and modeling the training dynamics is growing as increasingly larger networks are being trained. We propose in this work a model based on the correlation of the parameters' dynamics, which dramatically reduces the dimensionality. We refer to our algorithm as \emph{correlation mode decomposition} (CMD). It splits the parameter space into groups of parameters (modes) which behave in a highly correlated manner through the epochs. We achieve a remarkable dimensionality reduction with this approach, where networks like ResNet-18, transformers and GANs, containing millions of parameters, can be modeled well using just a few modes. We observe each typical time profile of a mode is spread throughout the network in all layers. Moreover, our model induces regularization which yields better generalization capacity on the test set. This representation enhances the understanding of the underlying training dynamics and can pave the way for designing better acceleration techniques.
translated by 谷歌翻译
The attention mechanism is considered the backbone of the widely-used Transformer architecture. It contextualizes the input by computing input-specific attention matrices. We find that this mechanism, while powerful and elegant, is not as important as typically thought for pretrained language models. We introduce PAPA, a new probing method that replaces the input-dependent attention matrices with constant ones -- the average attention weights over multiple inputs. We use PAPA to analyze several established pretrained Transformers on six downstream tasks. We find that without any input-dependent attention, all models achieve competitive performance -- an average relative drop of only 8% from the probing baseline. Further, little or no performance drop is observed when replacing half of the input-dependent attention matrices with constant (input-independent) ones. Interestingly, we show that better-performing models lose more from applying our method than weaker models, suggesting that the utilization of the input-dependent attention mechanism might be a factor in their success. Our results motivate research on simpler alternatives to input-dependent attention, as well as on methods for better utilization of this mechanism in the Transformer architecture.
translated by 谷歌翻译
在本文中,我们提出了时间序列分类方法的创新转移学习。我们没有使用UCR存档中的现有数据集作为源数据集,而是生成了15,000,000个合成单变量时间序列数据集,该数据集是使用我们唯一的合成时间序列生成器算法创建的,该数据可以生成具有不同模式和角度和角度和不同序列长度的数据。此外,我们没有像以前的研究一样使用UCR存档提供的分类任务作为源任务,而是使用自己的55个回归任务作为源任务,这比从UCR存档中选择分类任务更好
translated by 谷歌翻译
轮廓引导优化是一种有效的技术,用于提高基于动态行为的编译器的优化能力,但收集配置文件数据昂贵,繁琐,并且需要定期更新以保持新鲜。我们提出了一种推断分支概率的新型统计方法,可以提高未经配置文件的汇编编译的程序的性能。我们使用从具有分支概率信息的大型二进制文件收集的信息进行离线培训。编译器使用学习的模型来预测常规未用性程序的分支概率,编译器可以用于通知优化决策。我们将我们的技术直接整合在LLVM中,补充了现有的人工工程编译器启发式。我们在一套基准中评估我们的技术,展示了在没有简档信息的情况下编制的一些收益。在部署中,我们的技术不需要分析运行,并且对编译时间有可忽略的影响。
translated by 谷歌翻译
我们在自我神经调节任务中获得了一个人的学习进步的个人签名,由功能MRI(FMRI)为指导。签名基于在第一节中给定第二神经融合会话中Amygdala的活性。该预测由深神经网络进行,这是在整个培训队训练的患者的培训。该信号,其指示人在执行Amygdala调制任务方面的进步,在多个原型脑状态中聚集,然后通过线性分类器对各种个人和临床适应症进行分类。所获得的签名的预测力比以前从FMRI神经融合获得个人签名的方法更强,并且提供了人的学习模式可以用作诊断工具的指示。我们的代码已提供,并通过道德批准,共享数据。
translated by 谷歌翻译